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Abstract 

Wfe present a deep layered architecture that generalizes 
convolutional neural networks (ConvNets). The architec¬ 
ture, called SimNets, is driven by two operators: (i) a sim¬ 
ilarity function that generalizes inner-product, and (ii) a 
log-mean-exp function called MEX that generalizes maxi¬ 
mum and average. The two operators applied in succession 
give rise to a standard neuron but in ’feature space The 
feature spaces realized by SimNets depend on the choice 
of the similarity operator. The simplest setting, which cor¬ 
responds to a convolution, realizes the feature space of 
the Exponential kernel, while other settings realize feature 
spaces of more powerful kernels (Generalized Gaussian, 
which includes as special cases RBF and Laplacian), or 
even dynamically learned feature spaces (Generalized Mul¬ 
tiple Kernel Learning). As a result, the SimNet contains a 
higher abstraction level compared to a traditional ConvNet. 
We argue that enhanced expressiveness is important when 
the networks are small due to run-time constraints (such as 
those imposed by mobile applications). Empirical evalua¬ 
tion validates the superior expressiveness of SimNets, show¬ 
ing a significant gain in accuracy over ConvNets when com¬ 
putational resources at run-time are limited. We also show 
that in large-scale settings, where computational complex¬ 
ity is less of a concern, the additional capacity of SimNets 
can be controlled with proper regularization, yielding accu¬ 
racies comparable to state of the art ConvNets. 

1. Introduction 

Deep neural networks, and convolutional neural net¬ 
works {ConvNets) in particular, have had a dramatic im¬ 
pact in advancing the state of the art in computer vision, 
speech analysis, and many other domains (cf. ll^lS^fTTll i 
It has been demonstrated time and time again, that when 
ConvNets are trained in an end-to-end manner, they deliver 


significantly better results than systems relying on manually 
engineered features. 

The goal of this paper is to introduce a generalization of 
ConvNets we call Similarity Networks (SimNets), that pre¬ 
serves the simplicity and effectiveness of ConvNets, yet has 
a higher abstraction level. In a nutshell, the inner-product 
operator, which lies at the core of the ConvNet architecture, 
is replaced by an inner-product in “feature space”. The fea¬ 
ture spaces are controlled by a family of kernel functions 
which include in particular the conventional (linear) inner- 
product as a special case. 

We argue that the incentive for designing deep networks 
with a higher abstraction level than ConvNets, arises from 
the need for small networks that could fit into mobile plat¬ 
forms in terms of space and run-time. With small networks 
the approximation error becomes a limiting factor, which 
could be ameliorated through network architectures that are 
based on a higher level of abstraction. 

The SimNet architecture is based on two operators. The 
first is analogous to, and generalizes, the inner-product op¬ 
erator of neural networks. The second, as special cases, 
plays the role of non-linear activation and pooling, but has 
additional capabilities that take SimNets far beyond Con¬ 
vNets. In a detailed set of experiments, the SimNet architec¬ 
ture achieves state of the art accuracy using networks with 
complexity comparable to that of top performing ConvNets. 
However, when network complexity is limited, SimNets de¬ 
liver a significant boost in accuracy. 

Recently, the task of reducing run-time complexity of 
ConvNets is receiving increased attention. For example, a 
method named FitNets ( 129)1 ). based on the knowledge dis¬ 
tillation principle ( ITSl '). has been suggested in order to as¬ 
sist in compressing deep networks. In l34ll . a form of gating 
inspired by Long Short-Term Memory recmi'ent networks is 
introduced, allowing training of very deep and narrow net¬ 
works. Another line of work considers imposing structural 
constrains on network weights, such as sparsity, in order 
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to improve run-time efficiency (uniiiiiisiEiiii). Alterna¬ 
tively, network weights may be factorized using matrix or 
tensor decompositions, reducing storage and computational 
complexity, at the expense of marginal deterioration in ac¬ 
curacy ( lfT0ll20l[^l24ll^[^ l5ll. All of these approaches 
consider ConvNets (or neural networks) as a baseline, and 
use supplementary techniques to reduce run-time complex¬ 
ity. In this work, we propose the alternative (generalized) 
SimNet architecture, and argue that it is inherently more 
efficient than ConvNets. The techniques listed here for re¬ 
ducing run-time complexity of ConvNets could just as well 
be applied to SimNets, thereby resulting in even more com¬ 
putationally efficient models. 

2. The SimNet architecture 


exhibits the following “collapsing” property 

MEXi 3 {MEXp{Cij}i<j<m}l<i<n 

Given the definition in eqn. the second SimNet oper¬ 
ator consists of taking MEX over an input x S with a 
bias vector b G - one per input coordinate]^ 

MEX operator : MEXp-^Q{xi -\- 

Note that unlike a conventional MLP unit which has a bias 
scalar, a MEX unit has a vector of biases. We may choose 
to omit part or all of the biases as part of a network design. 
For example, when all biases are dropped the MEX operator 
implements a soft trade-off between maximum and average. 

3. SimNet MLP 


A feed-forward fully-connected neural network, also 
known as a multilayer perceptron (MLP), is based on a sin¬ 
gle operator. Given x G as input to a layer of neurons, 
the output of the r’th neuron in the layer is cr(wjx -f b^), 
where cr(-) is a non-linear activation function. An MLP is 
constructed by forward chaining the input/output operation 
to create a layered network. The learned parameters of the 
network are the weight vectors w,, and biases b^, per neu¬ 
ron. 

The SimNet architecture consists of two operators. The 
first operator is a weighted similarity function between an 
input X G and a template z G 

similarity operator : u^(^(x, z) 


where u G K+ is a weight vector and <() : x —>• is 
a point-wise similarity mapping. We consider two forms of 
similarity mappings: the “linear” form (^ii„(x,z)i = XiZi, 
and the “^p” form (/)f^(x,z)i = —\xi — Zi\P defined for 
p > 0. Note that when setting u = 1, the corresponding 
similarities reduce to inner-product and p-distance (by the 
power of p) respectively. Note also that unlike the MLP 
operator, the similarity does not include a bias term. This 
functionality is covered, in a much more general sense, by 
the second operator described below. 

For the second SimNet operator we define MEX - a log- 
mean-exp function: 


MEXp{c,} := ^log 

2=1,...,n P 



^exp{/3-Ci} 

i=l 


( 1 ) 


The parameter /? G K spans a continuum between maxi¬ 
mum (/? —)■ -foo), average (j3 —>■ 0) and minimum (/3 —)■ 
—oo), and for a fixed value of /? the function is smooth and 


A SimNet analogy of an MLP with a single hidden layer 
is obtained by applying the two operators defined in sec.j^ 
one after the other - similarity followed by MEX. The re¬ 
sulting network is illustrated in fig.[TJa). It includes n hid¬ 
den similarity units corresponding to weighted templates 
{(z;, and fc output MEX units associated with bias 

vectors Denote by hr{x) the value of the r’th 

output unit when the network is fed with input x G M'^: 
hr{x) := MEXi 3 {uJ(j){'x,Zi) + a classifier 

of X into one of k categories, the network predicts the label 
r for which h^- (x) is maximal: 

p(x) = argmaxM£;X/ 3 {u^(/i(x,Zi) -f &w}r=i 

As it turns out, SimNet MLP is closely related to kernel 
machines. In particular, with linear similarity, i.e. with the 

*The collapsing property, as well as smoothly generalizing maximum 
and average, will prove to be essential for us. We are not aware of other 
functions that meet these three requirements. Specifically, the common 
softmax function ^ log exp{/3-Ci}) collapses and generalizes maxi¬ 
mum but does not generalize average, and the alternative softmax function 
generalizes maximum and average but does not col¬ 
lapse. 

^The MEX operator can be viewed as an “inner-product in log-space”. 
More accurately, if x and b ai‘e log-space representations of two vec¬ 
tors c and d respectively (i.e. Xi = logc^ and bi = log di), then 
MEXp^i{xi + bi}i = log{c,d) — logd. In words, the MEX op¬ 
erator (with ^ = 1) taken over the log-space representations of c and d is 
equal (up to an additive constant) to the log-space representation of their 
inner-product. 

^Note that with uniform weights (u^ = 1), linear similarity map¬ 
ping (f) and (3 -j-oo we have /ir(x) = max {z^^x + 
i.e. the network outputs are “maxout” units ( 1131 ). SimNet MLP is 
not the first to generalize maxout. Other generalizations have been sug¬ 
gested, notably the recently proposed Lp unit ( 1151 ). which is defined by 
\^J X + 6^^/]^)^/^, and tends to max; x + 6^^; j} as p ^ -j-oo. 
The differences between SimNet MLP and Lp unit as maxout generaliza¬ 
tions are: (i) Lp unit generalizes maximum of absolute values which only 
coincides with maxout if the arguments are non-negative, and (ii) Lp unit 
tries to realize maxout with a single operator whereas SimNet MLP imple¬ 
ments maxout with a succession of two operators. 
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inner-product operator on which neural networks are based, 
it is a support vector machine (S VM) based on the Exponen¬ 
tial kernel. Replacing the linear similarity with ip boosts the 
abstraction level of SimNet MLR, by lifting it to a General¬ 
ized Multiple Kernel Learning (GMKL, ll37l ') engine with a 
Generalized Gaussian kernel. The remainder of this section 
provides the details. 

SimNet MLR outputs can be written as; 


hri-x.) = MEXp{uJ(j){x,Zi) + brl}]!'=i 
1 /f n r d 

= ^ In ( - aw exp /3 VI ui^i(j){x, zi)i 


1=1 


'^ctri ■ Ke{x,zi) 


\i=i 


where a^/ := ex\>{f3hri}, 9 = (</), u), and a[t) = 

(l/,0) ln(f/n) is a non-linear activation function. The map¬ 
ping Kg for the linear and ip similarities takes the following 
forms: 

Kiin{x,z) = exp{/3x^z} 

Ki^{x,zi) = exp - zi^i\P 

[ i=i 

Klin is known as the Exponential kernel ( OOl l. and Ki 
is a GMKL. Specifically, fixing uniform weights (u; = 1) 
and p < 2 reduces to what is known as the General¬ 
ized Gaussian kernel. For the particular cases p = 2 and 
p = 1 we get the radial basis function (RBF) and Laplacian 
kernels respectively. When the weights u; and/or order p 
are learned, the exact underlying kernel is selected during 
training and we amount at a GMKL. 

Denoting by ipg a feature mapping associated with Kg, 
we get: 

hr{x) = a ((V'e(x), Wr)) 

where := Yl'i=i is a learned vector in feature 

space. We thus conclude that SimNet MLR output units are 
“neurons in feature space”, where the space corresponds to 
the Exponential kernel in the case of linear similarity, and 
to the Generalized Gaussian kernel in the case of ip simi¬ 
larity with fixed weights U; and order p. When the weights 
and/or order are learned, the feature space is selected during 
training, which is equivalent to saying that SimNet MLR is 
a GMKL. 

One may ask if perhaps a different choice of kernel, 
more elaborate than Generalized Gaussian, will suffice in 
order to capture SimNet MLR with ip similarity and learned 
weights as a simple kernel machine. Apparently, as theo¬ 
rem □ (P roven in M) shows, such a kernel does not exist, 
i.e. a GMKL is indeed necessary in order to represent Sim¬ 
Net MLR in all its glory. 


Theorem 1. For any dimension d G N, and constants c > 0 
and p > 0, there are no mappings Z : —>■ and 

U ■. ^ and a kernel K : x x x 

—>■ X such that for all z,x and u € 

K {\Z{x),U{x)], [z,u]) = exp^-cY^'l^-^Ui\xi - Zi|p|. 


4. Deep SimNets for processing images 


In the previous section we presented the basic MLR ver¬ 
sion of SimNets. In this section we describe two (orthog¬ 
onal) directions of extension. The first is the addition of 
locality, sharing and pooling for processing images (Sim¬ 
Net MLRConv, sec. |4.1| l, while the second focuses on deep¬ 
ening the network (adding layers) for enhanced capacity 


(sec. 4.3 I. In this context we introduce a “whitened” 


similarity layer through a succession of a convolution (lin¬ 
ear similarity) followed by ip similarity with receptive field 
1 X 1. 


4.1. SimNet MLPConv 

The extension of SimNet MLR for processing images 
follows the line of the MLRConv structure suggested 
in ll26l . and we accordingly refer to it as SimNet MLRConv. 
In particular, ll26l convolved a standard MLR across an in¬ 
coming 3D array by successively applying it to patches and 
stacking the outputs in a spatially coherent manner. This re¬ 
sults in a bank of feature maps, which may be summarized 
into prediction scores through global average pooling. Sim¬ 
Net MLRConv follows the same principles - a SimNet MLR 
is convolved across an incoming 3D array, and the resulting 
feature maps are summarized via global MEX pooling. An 
illustration of SimNet MLRConv is provided in fig.[TJc). In 
the figure, G refers to the input patch in location 

ij, z; e and u/ G denote similarity templates 

and weights respectively, f : R^^^ x R^^^ —>• is 

the similarity mapping (linear or ip), /3i G K and bri G K 
are the MEX parameter and offsets of the underlying Sim¬ 
Net MLR, and /32 G K is the MEX parameter of the final 
global pooling layer. 

When used to classify images, the prediction rule 
associated with SimNet MLRConv is given by: yifnpuf) = 
aigmax^MEXp^ [MEXp^ [nj4>{xij,Zi) 

Setting Pi = (32 = P, and using the collapsing property of 
MEX, we get a “patch-based” version of SimNet MLR’s 
classification: 


{/{input) = aTgmaxMEXp{ulcj){xij,zi) -f brp} 

It can be shown (||8l) that all results put forth in sec.j^for 
relating SimNet MLR to kernel machines apply to SimNet 
MLRConv as well, but with the underlying kernels being 
based on “patch-representations”. In other words, SimNet 
MLRConv - a “patch-based” extension of SimNet MLR, 
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Figure 1. (a) SimNet MLP - SimNet analogy of MLP with single hidden layer (secj^ (b) conv—>■ fp-sim structure - implements whitened £p similarity 
(sec. |4.2) (c) SimNet MLPConv - single layer SimNet for processing images (sec. |4. l| (d) L-layer SimNet for processing images (sec. |4.3) . Best viewed 
in color. 


maintains all kernel relations of the latter, with a “patch- 
based” extension of the underlying kernels. 

4.2. Whitening with convolutional layer 

We now describe a simple yet powerful addition to the £p 
similarity operator. Recall that the £p similarity between an 
input X G and a template z G with weights u G K+, 
is defined by — Ui\xi — Zi\P. Up to a constant that de¬ 
pends on u (and p), this is equal to the log probability den¬ 
sity of the input x being drawn from a Generalized Gaussian 
distribution with independent components, shape p, mean 
z, and scales vT^/'p. These ideas are further developed in 
sec.|^ however it is clear at this point that in order to cap¬ 
ture this probabilistic model, it would be desirable for the 
input X to have statistically independent coordinates. Com¬ 
mon practice in such cases is to seek for a matrix W for 
which the linearly transformed vector VFx has independent 
coordinates. This is referred to in the literature as ICA - 
independent component analysis ( ifT^ ). Assuming such a 
matrix is found, it would then be natural to “whiten” inputs, 
i.e. multiply them by W, before measuring their £p similar¬ 
ities to weighted templates. Besides better compliance with 
the coordinate independence assumption, this also gives rise 
to the possibility of dimensionality reduction. In particular, 
we may set the matrix W to cancel-out low-variance prin¬ 
cipal components of x, thereby producing whitened vectors 
of a lower dimension. This can be useful for both noise 


reduction and computational efficiency. 

In the context of SimNet MLPConv, adding support for 
whitening before £p similarity is simple - it merely requires 
a convolutional layer (linear similarity) followed by an £p 
similarity layer with receptive field 1x1. Such a con¬ 
struct, which we refer to as conv—^p-sim, is illustrated 
in fig. I^b). In this figure, input patches x^ are trans¬ 
formed into d-dimensional vectors yij by a convolutional 
layer with d filters wj that hold the rows of the whiten¬ 
ing matrix W. The whitened vectors are then matched 
against n weighted templates in the £p similarity layer, pro¬ 
ducing n similarity maps as output. To recap, one may add 
whitening to £p similarity by replacing the similarity layer 
with a conv^ fp-sim structure, which consists of convolu¬ 
tion followed by 1 X 1 similarity. 

In sec. I^we describe how to pre-train a conv—> fp-sim 
structure, and in particular how to initialize the filters so 
that they perform the whitening transformation they are in¬ 
tended for. Before that however, we show how SimNet 
MLPConv can be extended into an image processing Sim¬ 
Net of arbitrary depth. 

4.3. Going deep with SimNet MLPConv 


After laying out the basic SimNet construct (SimNet 
MLP - sec. 1^, equipping it with spatial structure (SimNet 
MLPConv - sec. 14. lb, and adding whitening to its £p simi¬ 


larity (conv—£p-sim - sec. 4.2 1 , we are finally in a position 


to define an arbitrarily deep SimNet for processing images. 
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Our starting point is SimNet MLPConv with whitened ip 
similarity. This network accounts for a single layer (conv— 
fp-sim) followed by a classifier (classification MEX and 
global MEX pooling). Adding depth to the network simply 
amounts to appending preceding conv—> ip-sim layers, op¬ 
tionally separated by MEX pooling. A general L-layer Sim- 
Net following this architectural prescription is illustrated in 
fig. [TJd). In this structure, conv—> £p-sim layers measure 
whitened ip similarities of incoming patches to weighted 
templates, MEX pooling operations summarize spatial re¬ 
gions in similarity maps by MEX’ing them together (note 
that both average pooling and max pooling are special cases 
of this), the MEX classification uses its offsets bri to clas¬ 
sify each location in the final similarity maps, and the fi¬ 
nal global MEX pooling summarizes the local classifica¬ 
tions into global class scores. The parameters that may be 
learned during training are: - linear filters 

in conv—>■ ^p-sim; . .z['^^ and - similar¬ 
ity templates and weights in conv—£p-sim; - 

similarity orders in conv—)■ fp-sim; - MEX pa¬ 

rameters in local pooling; - MEX parameter in clas¬ 
sification; bri - MEX offsets in classification; - MEX 
parameter in global pooling. In the following section we 
describe methods for initializing these parameters prior to 
training (pre-training). 


5. Pre-training 


In this section we briefly describe a method for pre¬ 
training an L-layer SimNet as illustrated in fig. 0 ^)- 
Our initialization scheme covers the parameters of conv—^ 
fp-sim layers (linear Alters similarity 


templates z 


( 1 ) 


AL) 


weights 


( 1 ) 


and orders 


assuming predetermined local MEX pooling 
parameters (/3*-^\ ...,Two attractive properties of the 
scheme are: (i) it is unsupervised (does not require any 
labels), and (ii) it gives rise to automatic selection of the 
number of channels in the convolutions and similarities of 
conv—fp-sim layers. 

The initialization is applied layer by layer in a forward 
sweep, thus in order for it to be defined, it suffices to con¬ 
sider a single conv— 
sec. 


p-sim layer (fig. 0b)). Recall from 


4.2 that we interpret the convolution in conv-^ 


sim as a linear transformation that whitens (and possibly 
reduces the dimension of) input patches prior to similar¬ 
ity measurements. Accordingly, we initialize its Alters 
Wi,..., Wrf as the rows of a whitening matrix W estimated 
via ICA (HD) on patches. 

Turning to the initialization of similarity templates 
(zi, ...z„), weights (ui,..., u„) and order (p), we recall that 
an ip similarity between an input y G and a tem¬ 
plate z G with weights u G is defined to be 
— '^t\yt — ZtY’. Consider now a probability distribu¬ 


tion over defined by a mixture of n Generalized Gaus- 
sians (with priors Xi > 0 , \i — 1 ), all having the same 
shape parameter (/3 > 0), and each having independent 
coordinates with separate scales and means (a; t > 0 and 
G R respectively, for coordinate t of component 1): 


^(y) = E^'n 


/3 


1^1 


t^l 


2a,,tr(l//3) 




The log probability density of a vector drawn from this dis¬ 
tribution being equal to y and originating from component 
I is: logP(y A comp. 1) = - + c;, 

where c/ := log|A/nt*=i 2 ai tr(i/; 3 ) } ^ constant that 

does not depend on y. This implies that if we model 
whitened patches y^ with a Generalized Gaussian mixture 
as above, initializing the similarity templates via zi^t = yi,t, 
the weights via u; 4 = and the order via p = (3 would 
give: 

= logP(y Acomp. 1) - q 

In words, similarity channel I would hold, up to a constant, 
the probabilistic heat map of component I and the whitened 
patches yij. This observation suggests estimating the pa¬ 
rameters of the mixture (shape /3, scales a; t and means pi t) 
based on whitened patches (via EM, cf. ID), and initializing 
the similarity parameters accordingly. We note in passing 
that it is possible to append additive biases bi to the simi¬ 
larity (through offsets of the succeeding MEX operator), in 
which case initializing these via bi = c; would make the 
probabilistic heat maps exact (not up to a constant). 

Einally, as stated above, the initialization scheme pre¬ 
sented induces an automatic selection of the number of con¬ 
volution and similarity channels in conv—£p-sim. The 
number of convolution channels corresponds to the dimen¬ 
sion to which input patches are reduced during whitening, 
thus may be set via methods for estimating effective di¬ 
mensionality of data (e.g. ED). Similarity channels corre¬ 
spond to components in the mixture estimated for whitened 
patches, thus may be set via methods for estimating the 
number of components in a mixture (e.g. El). 


6. Experiments 

To evaluate the effectiveness of SimNets, we compared 
them against alternative ConvNets in three experiments of 
increasing complexity. In the first experiment, we ran a sin¬ 
gle layer SimNet against an equivalent single layer Con- 
vNet, and studied the effect of model size (number of con¬ 
volution/similarity channels) on the accuracy of the two net¬ 
works. In a second experiment, we compared compact two 
layer SimNets against the best performing publicly avail¬ 
able ConvNet we are aware of that has comparable com¬ 
plexity. In the third and Anal experiment, we constructed a 
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Figure 2. (a) Single layer ConvNet compared against single layer SimNet on CIFAR-10 (b) CIFAR-10 cross-validation accuracies of single-layer networks 
as a function of the number of floating-point operations required to classify an instance (c) Caffe ConvNet compared against two layer SimNet on CIFAR-10 
and SVHN (for CIFAR-100, number of output units increased from 10 to 100). Best viewed in color. 


large three layer SimNet designed to compete against state 
of the art ConvNets. Our experiments demonstrate that Sim- 
Nets are signihcantly more accurate than ConvNets when 
networks are constrained to be compact, i.e. when compu¬ 
tational load at run-time is limited. This complies with our 
theoretical analysis in sec. which shows that weighted 
£p similarity exhibits an expressive power that goes beyond 
kernel machines, whereas linear similarity (the case associ¬ 
ated with ConvNets) is fully captured by the Exponential 
kernel. Asymptotically as the dimension increases, even 
a simple kernel machine becomes expressive enough for a 
given problem, and more elaborate expressiveness may ac¬ 
tually be a burden, as it aggravates overfitting. Nonetheless, 
we see in our experiments that with proper regularization, 
large-scale SimNets achieve accuracies comparable to state 
of the art ConvNets. 

6.1. Experimental details 

The datasets used in our experiments are CIFAR-10 and 
CIFAR-100 (EH), as well as SVHN (1121). These three 
datasets together form an image recognition benchmark that 
is diverse and challenging on one hand, yet simple enough 
to enable granular controlled experiments such as those 
needed to evaluate a new architecture. All datasets con¬ 
sist of 32x32 color images. SVHN (Street View House 
Numbers) represents a rather simple classihcation bench¬ 
mark, where various methods are known to produce near¬ 
human accuracies. It contains approximately 600K images 
for training and 26K images for testing, partitioned into 10 
categories that correspond to the digits 0 through 9. CIFAR- 
100 contains 50K images for training and lOK images for 
testing, equally partitioned into 100 categories. With a rel¬ 
atively large number of categories, and only a few hundred 
training examples per class, CIFAR-100 represents a chal¬ 
lenging classification task. CIFAR-10 contains 50K im¬ 
ages for training and lOK images for testing, equally par¬ 


titioned into 10 categories. It brings forth a balanced trade¬ 
off between the simplicity of SVHN and the complexity of 
CIFAR-100, and accordingly served as the central dataset 
throughout our experiments. Namely, all cross-validations 
were carried out on CIFAR-10 (with lOK training images 
held out for validation), with SVHN and CIFAR-100 used 
for final evaluation only. In terms of implementation, we 
have integrated SimNets into Caffe toolbox ( ETl ). with the 
aim of making our code publicly available in the near future. 

In all our experiments, we trained both SimNets and 
ConvNets by minimizing softmax loss using SGD with 
Nesterov acceleration (0511. Batch size, momentum, 
weight decay and learning rate were chosen through cross- 
validation, though we observed, at least for the case of 
SimNets, that the following choices consistently produced 
good results: batch size 128, momentum 0.9, weight de¬ 
cay 0.0001 and learning rate 0.01 decreasing by a factor 
of 10 after 200 and 250 epochs (out of 300 total). Un¬ 
like ConvNets which are mostly initialized randomly nowa¬ 
days (111]), SimNets are naturally pre-trained using statis¬ 
tical estimation methods (sec. [^. For computational effi¬ 
ciency, we implemented stochastic versions of these algo¬ 
rithms. Unless otherwise stated, all reported SimNet results 
were obtained using its pre-training scheme. 

6.2. Single layer SimNet 

As an initial experiment we compared a single layer Sim¬ 
Net, i.e. a SimNet MFPConv with whitened Ip similarity 
(conv—^p-sim), to an equivalent single layer ConvNet de¬ 
fined for this purpose. We chose to design the ConvNet in 
accordance with the prescription given by Coates et al. in 
their study of single layer networks (lEl). The resulting net¬ 
work is illustrated in fig. |^a). As can be seen, it includes 
a single convolutional layer with 5x5 receptive field and 
ReFU activation, followed by max pooling over quadrants 
and dense linear classification. To align the SimNet with 
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this structure, we applied the whitened similarity to patches 
with spatial size 5x5, and since these have relatively low di¬ 
mension already (75), we did not reduce it further during 
whitening. 

To compare the networks as they vary in size (and 
run-time complexity), we set the number of convolu¬ 
tion/similarity channels (denoted n in hg.|^a) and fig.[TJc)) 
to 50, 100, 200, 400 and 800. Since the ConvNet requires 
less computations for a given number of channels, we also 
tried it with 1600 and 3200 channels. CIFAR-10 cross- 
validation accuracies produced by the ConvNet, the SimNet 
with ii similarity, and the SimNet with £2 similarity, are 
plotted in hg. |^b) against the number of FLO Ps (fl oating- 


point operations) required to classify an image ^ 


As can 

be seen, for a given computational budget, the accuracies 
of £1 and £2 SimNets are comparable, whereas the ConvNet 
falls significantly behind. 


6.3. Two layer SimNet 

The purpose of this second experiment was to compare 
SimNets against the best publicly available compact Con¬ 
vNet we could hnd. We are interested in a clean SimNet 
vs. ConvNet architectural comparison, and thus did not in¬ 
clude in the experiment model compression techniques such 
as those listed in sec. [I](e .g. FitNets 1291 1. which may be 
applied to both architectures. An additional reason to ex¬ 
clude these techniques, as well as other works dealing with 
compact ConvNets (e.g. unmoi), is that all results they 
report relate to networks that are significantly larger than 
those we are interested in evaluating, in many cases too 
large to ht a real-time mobile application. With the stated 
purpose of this experiment being a comparison against an 
off-the-shelf ConvNet that was not altered by us, we eventu¬ 
ally chose to work against the compact CIFAR-10 ConvNet 
that comes built-in to Caffe, the structure of which is illus¬ 
trated in hg.|^c). As the hgure shows, the network includes 
three 5x5 convolutions, each followed by ReLU activation 
and pooling. Two dense linear layers (separated by ReLU) 
map the last convolutional layer into network outputs (class 
scores). The SimNet to which we compared Caffe ConvNet 
is a two layer network that follows the general structure out¬ 
lined in fig.[TJd), with £2 similarity and architectural choices 
taken to maximize the alignment with Caffe ConvNet; 5x5 

^In this paper, we consider FLOPs to be a measure of computational 
complexity. We do not compare actual run-times, as our implementation 
of SimNets is relatively naive, not nearly as efficient as the highly opti¬ 
mized ConvNet code that comes built-in to Caffe. One may argue that like 
Caffe, many other hardware or software platforms are specifically designed 
for convolutions, and therefore ConvNets have a computational edge over 
SimNets. While this is true for some off-the-shelf systems, our goal in this 
paper is to address inherent algorithmic complexities, not specific plat¬ 
forms currently in the market. 

^To circumvent the computational price of exp and log functions in¬ 
cluded in SimNets, we used approximations that require up to 10 FLOPs 
per operation. The resulting degradation in accuracy is marginal. 


Network 

Acc. (%) 

FLOP 

Param. 

CIFAR-10 

Caffe ConvNet 

81.1 

24.8M 

145.6K 

Two layer SimNet 

85.5 

14.2M 

64.6K 

SVHN 

Caffe ConvNet 

94 

24.8M 

145.6K 

Two layer SimNet 

93.8 

14.2M 

64.6K 


CIFAR-100 

Caffe ConvNet 

52.4 

24.8M 

151.4K 

Two layer SimNet 

54.6 

14.6M 

70.3K 


Table 1. Two layer SimNet vs. Caffe ConvNet on CIFAR-10, 
SVHN and CIFAR-100 - comparison of test accuracies, number of 
floating-point operations required to classify an image, and num¬ 
ber of learned parameters. 


receptive field and 32 channels in the first similarity layer, 
5x5 receptive held and 64 channels in the second similar¬ 
ity layer, and MEX pooling between the similarities hxed 
to 3x3 max pooling with stride 2. 

The networks were initially evaluated on CIFAR-10. 
Training hyper-parameters for the SimNet were conhgured 
via cross-validation, whereas for Caffe ConvNet we used 
the values that come built-in to Caffe. After measuring 
CIFAR-10 test accuracies, the same settings (network archi¬ 
tectures and training hyper-parameters) were used to evalu¬ 
ate test accuracies on SVHN. For evaluation of test accura¬ 
cies on CIFAR-100, we again used the exact same settings 
as in CIFAR-10, but this time increased the number of out¬ 
put channels in both networks from 10 to 100. The results of 
this experiment are summarized in table [T] As can be seen, 
the SimNet is roughly twice as efficient as Caffe ConvNet, 
yet achieves signihcantly higher accuracies on the more 
challenging benchmarks (CIFAR-10 and CIFAR-100). On 
SVHN accuracies are comparable, the reason being that in 
this simple benchmark classihcation error is dominated by 
overht, to which the enhanced expressiveness of SimNets 
does not contribute. 

6.4. Three layer SimNet 

In the previous experiments we have seen that SimNets 
are more accurate than ConvNets when networks are con¬ 
strained to be compact, i.e. when classihcation run-time 
is limited. In such a setting, the lower approximation er¬ 
ror of SimNets plays an important role. In contrast, when 
networks are over-specihed (i.e. are much larger than nec¬ 
essary in order to model the problem at hand) - standard 
practice for achieving state of the art accuracy, the approxi¬ 
mation error is virtually zero, and the advantage of the Sim¬ 
Net architecture fades. Moreover, the additional expressive 
power of SimNets could actually be a burden, as additional 
regularization for controlling overht would be required. It 
is therefore of interest to explore the ability of SimNets to 
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reach state of the art accuracy with over-specified networks. 
This is the aim of our third and final experiment, carried out 
on CIFAR-10. 

In this experiment we used a three layer SimNet as de¬ 
scribed in fig.[2d), with the following architectural choices 
(determined via cross-validation); I 2 similarities; 192 sim¬ 
ilarity channels in all three layers with receptive field sizes 
5x5, 5x5 and 3x3 (respectively); max pooling after layer 
1 , average pooling after layer 2, in both cases pooling win¬ 
dows are 3x3 in size with stride 2 between them. We trained 
the network with basic data augmentation, and regularized 
using multiplicative Gaussian nois^in conv—£p-sim lay¬ 
ers. We did not make use of ensembles (||6l) or aggressive 
data augmentation that includes rescaling images (01). 
These practices are known to improve accuracy, but are or¬ 
thogonal to the SimNet vs. ConvNet distinction. We did 
not include them in our study in order to facilitate a simpler 
comparison between the two architectures. Tabledraws a 
comparison between the test accuracy reached by the Sim¬ 
Net and reported state of the art results that did not make 
use of ensembles or aggressive data augmentation. As the 
table shows, SimNets compare to state of the art ConvNets, 
even in the over-specified setting. 

As a final sanity check, we compared extremely com¬ 
pact versions of our three layer SimNet and Network in Net¬ 
work (NiN, 1261 )[^ Specifically, we changed the number of 
channels in all layers of both networks to 10, and removed 
dropout (NiN) and multiplicative Gaussian noise (SimNet), 
leaving all other hyper-parameters intact. The resulting net¬ 
works had only 5K parameters each, and required just 3.5M 
FLOPS to classify an image. With such limited resources we 
expect the SimNet to benefit from its inherent expressive¬ 
ness, and indeed, it outperformed NiN significantly, provid¬ 
ing 76.8% accuracy compared to 72.3% reached by NiN. 

7. Conclusion 

We presented a deep layered architecture called SimNets 
that generalizes convolutional neural networks. The archi¬ 
tecture is driven by two operators: (i) the similarity opera¬ 
tor, which is a generalization of the inner-product operator 
on which ConvNets are based, and (ii) the MEX operator, 
that can realize non-linear activation and pooling, but has 
additional capabilities that make SimNets a powerful gener¬ 
alization of ConvNets. An interesting property of the Sim¬ 
Net architecture is that applying its two operators in succes¬ 
sion - similarity followed by MEX, results in what can be 
viewed as an artificial neuron in a high-dimensional feature 

^This regularization technique was shown to be more effective than 
dropout (’ 1331 ). and better suits the nature of SimNets (zeroing out an input 
coordinate does not neutralize its effect on l-p similarity). 

^We chose to work against NiN since it bears an architectural resem¬ 
blance to our SimNet, thus it was clear how both networks can be made 
compact in an analogous way. 


Method 

Acc. (%) 

Network in Network (Il26ll) 

91.19 

Deeply Supervised Nets (02511) 

92.03 

Highway Network (0340) 

92.4 

ALL-CNN (132!) 

92.75 

Three layer SimNet 

92.18 


Table 2. Three layer SimNet vs. state of the art ConvNets on 
CIFAR-10 (ensemble and aggressive data augmentation methods 
excluded) - comparison of test accuracies. 


space (sec. |^. This also holds for the more elaborate im¬ 
age processing SimNet incorporating locality, sharing and 
pooling (sec. 4.1 1 . 

The feature spaces realized by SimNets depend on the 
choice of similarity type: linear or with/without weights. 
We have shown that the simplest setting using linear sim¬ 
ilarity (corresponding to regular convolution) realizes the 
feature space of the Exponential kernel, while settings re¬ 
alize feature spaces of more powerful kernels (Generalized 
Gaussian, which includes as special cases RBE and Lapla- 
cian), or even dynamically learned feature spaces (General¬ 
ized Multiple Kernel Learning). These observations suggest 
that SimNets, when equipped with similarity, have higher 
abstraction level than ConvNets, which correspond to linear 
similarity. 

We argue that a higher abstraction level for the basic net¬ 
work building blocks carries with it the advantage of ob¬ 
taining higher accuracies with small networks, an impor¬ 
tant trait for mobile and real-time applications. Through 
a detailed set of experiments we validated the conjecture 
of higher accuracy for small networks, and we have also 
shown that SimNets can achieve state of the art accuracy in 
large-scale settings where computational efficiency is not a 
concern (and thus the higher abstraction per given network 
size is not an advantage). 

Finally, the SimNet architecture is endowed with a nat¬ 
ural pre-training scheme based on unlabeled data. Besides 
its aid in training, the scheme also has the potential of de¬ 
termining the number of channels in hidden layers based on 
statistical analysis of patterns generated in previous layers. 
This implies that the structure of SimNets can potentially 
be determined automatically based on (unlabeled) train¬ 
ing data. Future work includes a study of this capability, 
and more generally, further analysis of probabilistic prop¬ 
erties of SimNets and unsupervised/supervised algorithms 
derived thereof. 
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